Mikael did Assignment 1. Johannes did Assignment 2. Interpretations are done collaboratively.
This assignment revolves around network visualization of terrorist connections. Files trainData.dat and trainMeta.dat contain information about a network of the individuals involved in the bombing of commuter trains in Madrid on March 11, 2004. the names include are of those people suspsected of having participated and their relatives.
Use visNetwork package to plot the graph in which; strength of links variable to denote distance, nodes colored by bombing group, size of nodes proportional to number of connections, layout that optimimizes repulsion forces, all nodes that are connected to a curretnyl selected node by a path of one are highlighted.
Analyse the obtainted network, in particular describe which clusters you see in the network.
# Reading data
links <- read.delim("trainData.dat", sep = "", header = FALSE)
nodes <- read.delim("trainMeta.dat", sep = "", header = FALSE)
#links <- read.delim("/home/mikmo937/Desktop/VISUALIZATION/Lab5/trainData.dat", sep = "", header = FALSE)
#nodes <- read.delim("/home/mikmo937/Desktop/VISUALIZATION/Lab5/trainMeta.dat", sep = "", header = FALSE)
# Change column names so visNetwork recognizes the variables
colnames(nodes) <- c("label", "group")
colnames(links) <- c("from", "to", "strength")
# Change formatting of id variables to not be numeric and prefix single digits with 0 to have the same length
nodes$id <- 1:nrow(nodes)
nodes$id <- sprintf("%02d", as.integer(sub("s", "", nodes$id)))
links$from <- sprintf("%02d", as.integer(sub("s", "", links$from)))
links$to <- sprintf("%02d", as.integer(sub("s", "", links$to)))
nodes$id <- paste0("s",nodes$id)
links$from <- paste0("s",links$from)
links$to <- paste0("s",links$to)
# Set the id variable as first column so igraph recognizes it as an id variable
nodes <- nodes[,c(3,1,2)]
net <- graph_from_data_frame(d= links, vertices = nodes, directed = T)
#visIgraph(net)
# Create column that counts the number of connections and name it so visNetwork recognizes it as sizing variable
nodes$value <- strength(net)
# Change name of groups to participation or not and to factor class to correctly color the graph accordingly
nodes$group <- ifelse(nodes$group == "1", "Participation", "No participation")
nodes$group <- as.factor(nodes$group)
# Use strength of links variable to assign distance of links between nodes
links$width <- links$strength
graph1 <- visNetwork(nodes, links) %>% visNodes(labelHighlightBold = TRUE) %>% visPhysics(solver = "repulsion") %>% visLayout(randomSeed = 13) %>%
visLegend() %>% visGroups(groupname = "Participation", color = "orange", shadow = list(enabled = TRUE)) %>% visGroups(groupname = "No participation", color = "purple")
graph1
Graph network of possible bomb participators
There seems to be 3 clusters. Two clusters of people not having participated in bombings, and one cluster of people having participated. The cluster with people having participated have larger nodes and therefore more connections, indicating that there are few people that actually perform the bombings and are connected to multiple others. They also are in the middle of the other clusters which means they are connected to both sides. The two other clusters are separate from each other which indicates they don’t contact each other very much. They also have smaller nodes which tells us there are fewer connections made.
An argument could be made that there are more than 3 clusters in that there are several isolated “islands” of nodes, which only seem to have connections to eachother and expanding their connections via single person that has connections with broader groups.
Add a functionality to the previous plot that highlights all nodes that are connected to the selected node by a path of length one OR two. Check some amount of the largest nodes and comment which individual has the best opportunity to spreda the information in the network. Read some information about this person in Google and present your findings.
graph2 <- graph1 %>% visOptions(highlightNearest = list(enabled = TRUE, algorithm = "hierarchical",
degree = list(from = 0, to = 2)))
graph2
Graph network over possible bombers, double highlighted nodes
The two largest nodes are of two persons that has participated in bombings, Jamal Zougam and Mohamed Chaoui. They are very close to each other and highlighting their nodes gives access to the same networking clusters, albeit via different edges. They both have great reach in the network and can easily spread information to the majority, only a small group is not connected to either of them as a one- or two-way connection.
One big node in the non-participation group is Imad Eddin Barakat in the bottom left. His node is not as large as the previous ones, but he almost has as big of a reach even though he has never participated in bombings. Perhaps this means he is more of a person spreading information and planning the bombings instead of actually participating, which also could mean his reach is significant. Naima Oulad Akcha in the top right is a non-participator that has very long reach in his network. He seems to have a very significant reach aswell.
Something interesting is the cluster in the top left, which is a rather large amount of people that barely has any connections,and only have contact with one bomb participator that has a small amount of connections as well. These seem very secluded from the others, being the only group that has no contact with either of the major bomb participators.
We will look up Imad Eddin Barakat as he is possible a “spider in the web” where he has a lot of contacts and spread but without participating in the bombings. According to Wikipedia his nickname is Abu Dahdah and is a Syrian-born Spaniard who was sentenced to a 27-year prison term in Spain, however it was not for the train bombings in Madrid, but for his part in September 11 attacks and for his membership in al-Qaeda. However, a spanish intelligence officer reported to the Spanish parliament that the Madrid train bombings had been “Abu Dahdah, without any doubt”. It was alleged that he helped direct the operations during prison visits by a contact. According to this information the graph network seems to have relayed information well regarding who could be possibly involved and connected to the bombings.
Computer clusters by optimizing edge betweenness and visualized the resulting network. Comment whether the clusters you identified manually in step 1 were also disocvered by this clustering method.
# Create copy of current nodes and create a graph
nodes1 <- nodes # create copy of current nodes
net1 <- graph_from_data_frame(d = links, vertices = nodes1, directed = F)
# Identify clusters by computing edge betweenness which is a score of number of shortest paths through an edge
ceb <- cluster_edge_betweenness(net1)
# Create column of the new found clusters in the nodes dataframe
nodes1$group <- ceb$membership
# Create graph from it
graph3 <- visNetwork(nodes1, links) %>% visIgraphLayout()
graph3
Graph network with edge betweenness as distance
Imad Eddin Barakat (Abu Dahdah) who we looked up, now has a bigger node value and many connections to other people, so he seems to have significant reach in networking. However he does not seem to have reach to other networks so his reach is with his own so to say. The two people who were the highly connected participators, Jamal Zougam and Mohamed Chaoui, also have big nodes and in the middle of a big cluster which many contacts. Naima is not as prominent although his reach seems to reach people with different coloring i.e. different clusters or groups.
In general the first produced clusters, especially the hierarchical one with 2 step highlighting seem to have done a good job of visualizing the data in a way that was possible to denote important network connections.
Use adjacency matrix representation to perform a permutation by HC seriation method and visualize the graph as a heatmap. Find the most pronounced cluster and comment whether this cluster was discovered in steps 1 or 3.
# Return adjacency matrix of our most recently produced graph
netm <- get.adjacency(net1, attr="strength", sparse=F)
colnames(netm) <- V(net)$label
rownames(netm) <- V(net)$label
# Create ditance matrix from adjacency matrix
rowdist<-dist(netm)
# Order the distance matrix after HC cluster and save the order permutated indices as a vector
order1<-seriate(rowdist, "HC")
ord1<-get_order(order1)
# Reorder the adjacency matrix accordingly
reordmatr<-netm[ord1,ord1]
# PLot heatmap
plot_ly(z=~reordmatr, x=~colnames(reordmatr),
y=~rownames(reordmatr), type="heatmap") %>% layout(xaxis = list(title = "Suspected terrorist"),
yaxis = list(title = "Suspected terrorist"))
Heatmap of the suspected bomber participators
The most prominent cluster group in the top right is composed of 12 people. Three of the identified people; Imad Eddin Barakat, Jamal Zougam, and Mohamd Chaoui are all in this cluster. The fourth person, Naimad Oulad Akcha is situated on the row just below the line. This seems to indicate that the pronounced cluster could be identified by the cluster in step 1. However, the step 3 cluster was alot more cluttered and would probably require more spent time and attentiveness to recognize the significant persons without “finding” by referencing them from step 1 which we did. Due to this the step 1 cluster with 2-step highlighting is more preferred, however the step 3 graph could be optimized with colors, arranging of nodes and whatnot to improve the information it relays.
This assignment revolves around visualizing time series data about the consumption of oil (million tonnes) and coal (million tonnes oil equivalents) in China, India, Japan, US, Brazil, UK, Germany and France. Marker size shows how large a country is (1 for china and the US, 0.5 for all other countries).
Visualize data in Plotly as an anmiated bubble chart of Coal versus Oil in which the bubble size correpsonds to the country size. List several noteworthy features of the investigated animation.
# reading the data
#coaloil <- read.csv2("/home/mikmo937/Desktop/VISUALIZATION/Lab5/Oilcoal.csv")
coaloil <- read.csv2("Oilcoal.csv")
# removing the empty column
coaloil <- coaloil[,-6]
plot_ly(coaloil, x=~log(Coal), y=~log(Oil), frame =~Year,size = ~Marker.size,color=~Country)%>%
add_markers(hoverinfo = 'text',text = paste(coaloil$Country,'\nOil = ', coaloil$Oil,
'\nCoal = ', coaloil$Coal))%>%
layout(title = 'Bubble chart',legend = list(title=list(text='Countries'))) %>% animation_opts(
150, easing = "linear", redraw = FALSE )# making the bubbles move more smoothly and linear
Animated bubble chart for all countries’ coal and oil consumption from 1965-2009
Both axis are log of the actual value so you can see the smaller values better, if you hover over the bubbles you get access to the actual values of the variables. The graph makes you see how the relationship between coal and oil consumption change for each year for each country and between each country. The size of the bubbles are showing if the country is considered a big consumer in any of the variables.
India, China and Brazil all have similar pattern in monotonic increase of consumption for both coal and oil. The three european countries, UK, Germany and France, all have a decreasing usage of coal and a rather similar consumption of oil. Both Japan and the US have an increase in usage both not as steep as the others.
Find two countries that had similar motion patterns and create a motion chart including these countries only. Try to find historical facts that could explain some of the sudden changes in the animation behavior.
After identifying the charts, China and India seem to follow the same pattern as they first move upwards on the y axis and later on right on the x axis.
rows <- c(which(coaloil$Country == 'China'),which(coaloil$Country == 'India'))
plot_ly(coaloil[rows,], x=~log(Coal), y=~log(Oil), frame =~Year,size = ~Marker.size,color=~Country)%>%
add_markers(hoverinfo = 'text',text = paste(coaloil$Country[rows],'\nOil = ', coaloil$Oil[rows],
'\nCoal = ', coaloil$Coal[rows]))%>%
animation_opts(
100, easing = "linear", redraw = F)%>%
layout(title = 'Bubble chart for China and India',legend = list(title=list(text='Countries')))
Animated bubble chart for coal and oil consumption in China and India from 1965-2009
Now you can easily see that the 2 countries follow a similar pattern in the consumptions of coal and oil over the years.
One historical fact that could explain these sudden changes is the steady linear population boom for both countries. Indias population grew from 500 million to 1.22 billion, and Chinas grew from 723 million to 1.34 billion. These are both very dramatic increases in absolute and relative value, and naturally showcases a definite need for more energy and heat to sustain the population.
Compute a new column that shows the proportion of fuel consumption related to oil, \(Oil_p = \frac{Oil}{Oil+Coal}*100\). Visualize the proportions by means of animated bar charts using Plotly. Perform an analysis of this animation. What are the advantages of visualizing data in this way compared to the animated bubble chart? What are the disadvantages?
# the calculations
coaloil$OILp <- (coaloil$Oil/(coaloil$Oil + coaloil$Coal)) * 100
# a
# Making the new dataframe
new_df <- rbind(coaloil,coaloil)
new_df <- new_df[order(new_df$Country,new_df$Year),] # ordering the data so the next step will be correct
# every second row have a 0 on the new variable
new_df$OILp <- replace(new_df$OILp, seq_along(new_df$OILp)%%2==0,0 )
# b
p <- plot_ly() %>% animation_opts(frame = 120, redraw = TRUE)%>% layout(
title = 'Proportion of fuel consumption by Country',
xaxis = list(title = 'Proportion of fuel consumption'),
yaxis = list(title = ''),
showlegend = FALSE
)
# looping in lines for every country in the data
for (country in unique(new_df$Country)) {
trace <- new_df[new_df$Country == country,]
p <- p %>% add_lines(data = trace, x = ~OILp, y = ~Country,frame = ~Year, name = country, line = list(width = 25))
}
p
Animated bar chart of proportional fuel consumption in all countries from 1965-2009
Most of the countries have an increase in the proportion of fuel consumption over the years, especially UK, France, China and Germany. Japan and US goes against this pattern and have about the same or a decrease in the proportion over the time span.
In the beginning of the 80´s UK have a quick increase of the proportion for a year and then return to more normal proportions. It may be interesting to analyse what happend that year.
As this graph is over a proportion variable its much easier to understand the actual proportion compared to the bubble graph, the information about how much consumption the countries have had is gone in this graph, a high proportion doesn’t say anything about the actual consumption.
The advantages of this plot is you require less attentiveness to analyze the difference for one country to another and definite values are easier to see. Disadvantage is that only one variable is visualized so patterns relating the variables to each other are harder to see, i.e the double axis is an advantage of the bubble scatterplot.
Repeat the previous step but use “elastic” transition (easing). Which advantages and disadvantages can you see with this animation? Use information in https://www.easings.net/ to support your arguments.
p <- plot_ly() %>% animation_opts(
120, easing = "elastic", redraw = FALSE )%>%
layout(
title = 'Proportion of fuel consumption by Country',
xaxis = list(title = 'Proportion of fuel consumption'),
yaxis = list(title = ''),
showlegend = FALSE
)
# looping in lines for every country in the data
for (country in unique(new_df$Country)) {
trace <- new_df[new_df$Country == country,]
p <- p %>% add_lines(data = trace, x = ~OILp, y = ~Country,frame = ~Year, name = country, line = list(width = 25))
}
p
Animated bar chart of proportional fuel consumption in all countries from 1965-2009, elastic easing
Disadvantage The elastic function is probably better to use on other variables than year if you want a smooth motion, as a year start and stops instantly.
Advantage The elastic easing adds a more realistic motion between the years, as its probably not as smooth in real life as the graph in task 2.3 shows.
Use Plotly to create a guided 2D-tour visualizing Coal consumption in which the index function is given by Cnetral Mass index and in which observations are years and variables are different countries. Find a projection with the most compact and well-separated clusters. Do clusters correspond to different Year ranges? Which variable has the largest contribution to this projection? How can this be interpreted?
At around step 1.53 it looks like there could be the step with the most compact and well-separated clusters. When you hover over the points you can see that in each cluster there are nearby years, one cluster is the 60s, 70s to 1984, one is the end of the 80s and the last one is the 90s.
China has the largest contribution to the projection as its looks like the variable that separates the data points the most. China has the biggest growth and increase in coal consumption for all countries over the years as their consumption interval spans from around 125 in 1967 to 1530 million tonnes in 2009.
# setting graphs to be 1:1.5 in size in global options
knitr::opts_chunk$set(echo = TRUE, fig.width = 6,fig.height = 4, fig.align = 'center', warning = FALSE, message = FALSE, out.width = "80%")
set.seed(123)
# load necessary libraries
library(ggraph)
library(igraph)
library(visNetwork)
library(dplyr)
library(plotly)
library(seriation)
library(devtools)
library(reshape2)
#install_version("tourr", "0.5.5")
library(tourr)
# Reading data
links <- read.delim("trainData.dat", sep = "", header = FALSE)
nodes <- read.delim("trainMeta.dat", sep = "", header = FALSE)
#links <- read.delim("/home/mikmo937/Desktop/VISUALIZATION/Lab5/trainData.dat", sep = "", header = FALSE)
#nodes <- read.delim("/home/mikmo937/Desktop/VISUALIZATION/Lab5/trainMeta.dat", sep = "", header = FALSE)
# Change column names so visNetwork recognizes the variables
colnames(nodes) <- c("label", "group")
colnames(links) <- c("from", "to", "strength")
# Change formatting of id variables to not be numeric and prefix single digits with 0 to have the same length
nodes$id <- 1:nrow(nodes)
nodes$id <- sprintf("%02d", as.integer(sub("s", "", nodes$id)))
links$from <- sprintf("%02d", as.integer(sub("s", "", links$from)))
links$to <- sprintf("%02d", as.integer(sub("s", "", links$to)))
nodes$id <- paste0("s",nodes$id)
links$from <- paste0("s",links$from)
links$to <- paste0("s",links$to)
# Set the id variable as first column so igraph recognizes it as an id variable
nodes <- nodes[,c(3,1,2)]
net <- graph_from_data_frame(d= links, vertices = nodes, directed = T)
#visIgraph(net)
# Create column that counts the number of connections and name it so visNetwork recognizes it as sizing variable
nodes$value <- strength(net)
# Change name of groups to participation or not and to factor class to correctly color the graph accordingly
nodes$group <- ifelse(nodes$group == "1", "Participation", "No participation")
nodes$group <- as.factor(nodes$group)
# Use strength of links variable to assign distance of links between nodes
links$width <- links$strength
graph1 <- visNetwork(nodes, links) %>% visNodes(labelHighlightBold = TRUE) %>% visPhysics(solver = "repulsion") %>% visLayout(randomSeed = 13) %>%
visLegend() %>% visGroups(groupname = "Participation", color = "orange", shadow = list(enabled = TRUE)) %>% visGroups(groupname = "No participation", color = "purple")
graph1
graph2 <- graph1 %>% visOptions(highlightNearest = list(enabled = TRUE, algorithm = "hierarchical",
degree = list(from = 0, to = 2)))
graph2
# Create copy of current nodes and create a graph
nodes1 <- nodes # create copy of current nodes
net1 <- graph_from_data_frame(d = links, vertices = nodes1, directed = F)
# Identify clusters by computing edge betweenness which is a score of number of shortest paths through an edge
ceb <- cluster_edge_betweenness(net1)
# Create column of the new found clusters in the nodes dataframe
nodes1$group <- ceb$membership
# Create graph from it
graph3 <- visNetwork(nodes1, links) %>% visIgraphLayout()
graph3
# Return adjacency matrix of our most recently produced graph
netm <- get.adjacency(net1, attr="strength", sparse=F)
colnames(netm) <- V(net)$label
rownames(netm) <- V(net)$label
# Create ditance matrix from adjacency matrix
rowdist<-dist(netm)
# Order the distance matrix after HC cluster and save the order permutated indices as a vector
order1<-seriate(rowdist, "HC")
ord1<-get_order(order1)
# Reorder the adjacency matrix accordingly
reordmatr<-netm[ord1,ord1]
# PLot heatmap
plot_ly(z=~reordmatr, x=~colnames(reordmatr),
y=~rownames(reordmatr), type="heatmap") %>% layout(xaxis = list(title = "Suspected terrorist"),
yaxis = list(title = "Suspected terrorist"))
# reading the data
#coaloil <- read.csv2("/home/mikmo937/Desktop/VISUALIZATION/Lab5/Oilcoal.csv")
coaloil <- read.csv2("Oilcoal.csv")
# removing the empty column
coaloil <- coaloil[,-6]
plot_ly(coaloil, x=~log(Coal), y=~log(Oil), frame =~Year,size = ~Marker.size,color=~Country)%>%
add_markers(hoverinfo = 'text',text = paste(coaloil$Country,'\nOil = ', coaloil$Oil,
'\nCoal = ', coaloil$Coal))%>%
layout(title = 'Bubble chart',legend = list(title=list(text='Countries'))) %>% animation_opts(
150, easing = "linear", redraw = FALSE )# making the bubbles move more smoothly and linear
rows <- c(which(coaloil$Country == 'China'),which(coaloil$Country == 'India'))
plot_ly(coaloil[rows,], x=~log(Coal), y=~log(Oil), frame =~Year,size = ~Marker.size,color=~Country)%>%
add_markers(hoverinfo = 'text',text = paste(coaloil$Country[rows],'\nOil = ', coaloil$Oil[rows],
'\nCoal = ', coaloil$Coal[rows]))%>%
animation_opts(
100, easing = "linear", redraw = F)%>%
layout(title = 'Bubble chart for China and India',legend = list(title=list(text='Countries')))
# the calculations
coaloil$OILp <- (coaloil$Oil/(coaloil$Oil + coaloil$Coal)) * 100
# a
# Making the new dataframe
new_df <- rbind(coaloil,coaloil)
new_df <- new_df[order(new_df$Country,new_df$Year),] # ordering the data so the next step will be correct
# every second row have a 0 on the new variable
new_df$OILp <- replace(new_df$OILp, seq_along(new_df$OILp)%%2==0,0 )
# b
p <- plot_ly() %>% animation_opts(frame = 120, redraw = TRUE)%>% layout(
title = 'Proportion of fuel consumption by Country',
xaxis = list(title = 'Proportion of fuel consumption'),
yaxis = list(title = ''),
showlegend = FALSE
)
# looping in lines for every country in the data
for (country in unique(new_df$Country)) {
trace <- new_df[new_df$Country == country,]
p <- p %>% add_lines(data = trace, x = ~OILp, y = ~Country,frame = ~Year, name = country, line = list(width = 25))
}
p
p <- plot_ly() %>% animation_opts(
120, easing = "elastic", redraw = FALSE )%>%
layout(
title = 'Proportion of fuel consumption by Country',
xaxis = list(title = 'Proportion of fuel consumption'),
yaxis = list(title = ''),
showlegend = FALSE
)
# looping in lines for every country in the data
for (country in unique(new_df$Country)) {
trace <- new_df[new_df$Country == country,]
p <- p %>% add_lines(data = trace, x = ~OILp, y = ~Country,frame = ~Year, name = country, line = list(width = 25))
}
p
# creating a new df
coal_df <- dcast(coaloil[,c(1:3)],Year~Country, value.var = 'Coal')
mat <- rescale(coal_df[,c(2:9)])
set.seed(145)
tour <- new_tour(mat, guided_tour(cmass), NULL)
steps <- c(0, rep(1/15, 259))
Projs<-lapply(steps, function(step_size){
step <- tour(step_size)
if(is.null(step)) {
.GlobalEnv$tour<- new_tour(mat, guided_tour(cmass), NULL)
step <- tour(step_size)
}
step
}
)
# projection of each observation
tour_dat <- function(i) {
step <- Projs[[i]]
proj <- center(mat %*% step$proj)
data.frame(x = proj[,1], y = proj[,2],Year= coal_df$Year)
}
# projection of each variable's axis
proj_dat <- function(i) {
step <- Projs[[i]]
data.frame(
x = step$proj[,1], y = step$proj[,2], variable = colnames(coal_df[c(2:9)])
)
}
stepz <- cumsum(steps)
# tidy version of tour data
tour_dats <- lapply(1:length(steps), tour_dat)
tour_datz <- Map(function(x, y) cbind(x, step = y), tour_dats, stepz)
tour_dat <- dplyr::bind_rows(tour_datz)
# tidy version of tour projection data
proj_dats <- lapply(1:length(steps), proj_dat)
proj_datz <- Map(function(x, y) cbind(x, step = y), proj_dats, stepz)
proj_dat <- dplyr::bind_rows(proj_datz)
ax <- list(
title = "", showticklabels = FALSE,
zeroline = FALSE, showgrid = FALSE,
range = c(-1.1, 1.1)
)
# for nicely formatted slider labels
options(digits = 3)
tour_dat <- highlight_key(tour_dat, ~Year, group = "A")
tour <- proj_dat %>%
plot_ly(x = ~x, y = ~y, frame = ~step, color = I("black")) %>%
add_segments(xend = 0, yend = 0, color = I("gray80")) %>%
add_text(text = ~variable) %>%
add_markers(data = tour_dat, text = ~Year, ids = ~Year, hoverinfo = "text") %>%
layout(xaxis = ax, yaxis = ax)#%>%animation_opts(frame=0, transition=0, redraw = F)
tour